NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Unsupervised Partial Sentence Matching for Cited Text Identification

Ricci, Kathryn; Chang, Haw-Shiuan; Goyal, Purujit; McCallum, Andrew (October 2022, ACL Proceedings of the Third Workshop on Scholarly Document Processing)

Given a citation in the body of a research paper, cited text identification aims to find the sentences in the cited paper that are most relevant to the citing sentence. The task is fundamentally one of sentence matching, where affinity is often assessed by a cosine similarity between sentence embeddings. However, (a) sentences may not be well-represented by a single embedding because they contain multiple distinct semantic aspects, and (b) good matches may not require a strong match in all aspects. To overcome these limitations, we propose a simple and efficient unsupervised method for cited text identification that adapts an asymmetric similarity measure to allow partial matches of multiple aspects in both sentences. On the CL-SciSumm dataset we find that our method outperforms a baseline symmetric approach, and, surprisingly, also outperforms all supervised and unsupervised systems submitted to past editions of CL-SciSumm Shared Task 1a.
more » « less
Full Text Available
Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

https://doi.org/10.18653/v1/2022.acl-long.554

Chang, Haw-Shiuan; McCallum, Andrew (May 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics)

Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.
more » « less
Full Text Available
Augmenting Scientific Creativity with Retrieval across Knowledge Domains

Kang, Hyeonsu B.; Mysore, Sheshera; Huang, Kevin; Chang, Haw-Shiuan; Prein, Thorben; McCallum, Andrew; Kittur, Aniket; Olivetti, Elsa (January 2022, NLP+HCI Workshop at North American Chapter of the Association for Computational Linguistics 2022)

Exposure to ideas in domains outside a scientist's own may benefit her in reformulating existing research problems in novel ways and discovering new application domains for existing solution ideas. While improved performance in scholarly search engines can help scientists efficiently identify relevant advances in domains they may already be familiar with, it may fall short of helping them explore diverse ideas \textit{outside} such domains. In this paper we explore the design of systems aimed at augmenting the end-user ability in cross-domain exploration with flexible query specification. To this end, we develop an exploratory search system in which end-users can select a portion of text core to their interest from a paper abstract and retrieve papers that have a high similarity to the user-selected core aspect but differ in terms of domains. Furthermore, end-users can `zoom in' to specific domain clusters to retrieve more papers from them and understand nuanced differences within the clusters. Our case studies with scientists uncover opportunities and design implications for systems aimed at facilitating cross-domain exploration and inspiration.
more » « less
Full Text Available
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks

https://doi.org/10.1021/acs.jcim.9b00995

Kim, Edward; Jensen, Zach; van Grootel, Alexander; Huang, Kevin; Staib, Matthew; Mysore, Sheshera; Chang, Haw-Shiuan; Strubell, Emma; McCallum, Andrew; Jegelka, Stefanie; et al (March 2020, Journal of Chemical Information and Modeling)

Full Text Available
AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

https://doi.org/10.1145/3394486.3403323

Dong, Xin Luna; He, Xiang; Kan, Andrey; Li, Xian; Liang, Yan; Ma, Jun; Xu, Yifan Ethan; Zhang, Chenwei; Zhao, Tong; Blanco Saldana, Gabriel; et al (July 2020, KDD:20 The 26th {ACM} {SIGKDD} Conference on Knowledge Discovery and Data Mining)
null (Ed.)
Full Text Available

Search for: All records